This is not a molecule? これは分子ではないのか?

This is not a molecule? これは分子ではないのか?

This is not a molecule? これは分子ではないのか?

This is not a molecule? これは分子ではないのか?

This is not a molecule? これは分子ではないのか?

This is not a molecule? これは分子ではないのか?

This is not a molecule? これは分子ではないのか?

This is not a molecule? これは分子ではないのか?

This is not a molecule? これは分子ではないのか?

Content

Focus on data pre-processing of AIMS / DIMS data:

  • Different software solutions
  • Recommendations for handling direct MS towards qualitative and quantitative data

Content

  • Exploratory MS data analysis
  • Challenges to non-chromatographic MS data
  • Peak picking solutions
    • Vendor-based (Progenesis)
    • LC-MS and single spectrum-based processing using (xcms)

  • Feature annotation (MS1)
  • Quality assessment of features
  • Alternative approaches to AIMS data pre-processing

Exploratory MS data analysis

Data visualisation

Either a single mass spectrum or acquisition of multipe mass spectra over short timespan.

Plekhova, V., et al. Nature protocols. 2021

Data visualisation

Sample measurement leads to a recognisable ‘ambient’ peak. Mass spectra before and after are mainly noise from chemical and instrumental origin.

Plekhova, V., et al. Nature protocols. 2021

Data visualisation

Mass peaks of biological origin within the ‘ambient’ peak. However, large variability across scans, negatively influencing reproducibility.

Plekhova, V., et al. Nature protocols. 2021

Data visualisation

Mass peaks of biological origin within the ‘ambient’ peak. However, large variability across scans, negatively influencing reproducibility.

  • Acquisition MS parameters
  • Improve ionisation efficiency
  • Increase scan time

Peak picking across scan(s)

  • Single spectrum-based
  • Across (part of) the ambient peak

Peak picking across scan(s)

  • Single spectrum-based
  • Across (part of) the ambient peak

Peak picking across scan(s)

  • Single spectrum-based
  • Across (part of) the ambient peak

Peak picking across scan(s)

  • Single spectrum-based
  • Across (part of) the ambient peak

Challenges to non-chromatographic data

Limitations of AIMS data

  • Identification of region of interest (single scan or across peak)
  • No chromatographic separation: feature identification impractical
  • Mass deviation shifts (intra- and inter- batch differences)
  • Lock mass correction (can influence ionisation efficiency)
  • Feature correspondence between batches and in real-time setting

Limitations of AIMS data

  • Region of interest → highest ion count mass spectrum
  • Mass deviation shifts (intra- and inter- batch differences) → binning
However,
  1. loss of information on single scan + variability across scans (certainly with low scan time)
  2. distortion of accurate mass, rendering annotation near impossible
  3. without lock mass compound, no correction for large inter-batch (± 100-200 ppm)

Binning of AIMS data

sps <- raw %>% filterFile(3) %>% spectra
sps.binned <- Spectra::bin(sps[which.max(tic(sps))],binSize=0.01)
peaks <- sps.binned %>% peaksData

Binning of AIMS data

  1. No distinction between biological and noise mass peaks
  2. Feature information and accurate masses are lost
  3. Feature filtering on minimal intensity and CV

Peak picking solutions

Vendor-based peak picking solutions

  • Waters instruments: Progenesis
  • Agilent instruments: MassHunter

Vendor-based peak picking solutions

LA-REIMS data: waters .RAW files with Progenesis QI

  • Developed for LC-MS applications
  • Option for direct sample analysis, likely LC-MS approach to peak picking

  • Black box approach
  • Accurate masses distorted after batch processing
  • Little to no user control on peak picking

Vendor-based peak picking solutions

  • Software fully integrated in the instruments, streamlined workflow
  • Easily obtainable software and vendor supported (updates)
  • User friendly and no expert knowledge required
  • No custom approaches for different data, offers a one size fits all solution
  • Other options: open-source universal software for metabolomics (MS-DIAL)

Vendor-based peak picking solutions

  • Methodology likely based on chromatographic dogma
  • Algorithms unknown, rendering comparison between instruments/labs difficult
  • Each software has its own tailored solution
  • Non-flexible solution for dynamic mass shifts (real-time analyses)
  • Impractical integration of multiple batches (features characterized by different m/z)
  • Processing may lead to inclusion of noisy features
  • Traceability between raw and processed data difficult

Peak detection using xcms

Peak detection using cwt algorithm on centroided data.

Sample Image

Peak detection using xcms

cwp <- CentWaveParam(peakwidth=c(2,8), noise=100, ppm=50, snthresh=0.5, mzdiff=0.05, 
                     fitgauss=TRUE, extendLengthMSW=TRUE, firstBaselineCheck=FALSE, 
                     prefilter=c(1,100))

chr_1 <- findChromPeaks(chr_1, cwp)
chromPeaks(chr_1)
rt rtmin rtmax into intb maxo sn sample
11.154 8.109 13.183 8.6e+06 7.0e+06 3.1e+06 3 1
8.110 6.080 11.154 8.8e+06 7.2e+06 3.7e+06 4 2
12.168 9.124 14.198 1.1e+07 8.9e+06 4.0e+06 3 3
11.154 8.109 13.183 1.2e+07 9.0e+06 3.7e+06 3 4

Peak detection using xcms

cwp <- CentWaveParam(peakwidth=c(2,8), noise=100, ppm=50, snthresh=0.5, mzdiff=0.05, 
                     fitgauss=TRUE, extendLengthMSW=TRUE, firstBaselineCheck=FALSE, 
                     prefilter=c(1,100))

chr_1 <- findChromPeaks(chr_1, cwp)
chromPeaks(chr_1)
rt rtmin rtmax into intb maxo sn sample
11.154 8.109 13.183 8.6e+06 7.0e+06 3.1e+06 3 1
8.110 6.080 11.154 8.8e+06 7.2e+06 3.7e+06 4 2
12.168 9.124 14.198 1.1e+07 8.9e+06 4.0e+06 3 3
11.154 8.109 13.183 1.2e+07 9.0e+06 3.7e+06 3 4

Peak detection using xcms

‘Chromatographic peaks’ are not consistently identified, despite good mass peak shape in individual mass spectra.

chr_3 <- findChromPeaks(chr_3, cwp)
chromPeaks(chr_3)
  • Only detected in two out of four samples

Peak detection using xcms

Peak detection on the four files yields a total of only 2620 mass peaks.

cwt <- findChromPeaks(cwt, param=cwp)
peaks <- as.data.frame(chromPeaks(cwt))
  • Distribution of peaks detected very skewed
  • Majority is detected in only two samples
table(peaks$sample)
##    1    2    3    4 
##  489 1005  823  303

Correspondence in xcms

After correspondence a total of 655 features are identified.

## Perform the correspondence using fixed m/z bin sizes.
pdp <- PeakDensityParam(sampleGroups = sampleData(cwt)$sample_group,
                        minFraction = 0.4, bw = 30)
cwt <- groupChromPeaks(cwt, param = pdp)

featureDefinitions(cwt) |> head()

The majority of features has 2 to 4 ‘chromatographic’ peaks assigned.

table(featureDefinitions(cwt)$npeaks)
##   2   3   4   5   6 
## 298 169 157  22   9

Correspondence in xcms

A minority of features is defined by more than 4 mass peaks.

mz mzmin mzmax rt rtmin rtmax into intb maxo sn sample
CP0420 333.0618 333.0604 333.0641 11.154 8.109 14.198 6.8e+04 4.2e+04 1.8e+04 3 1
CP0599 333.1297 333.1277 333.1352 7.095 5.065 8.110 1.4e+03 1.3e+03 8.0e+02 10 2
CP1451 333.0641 333.0641 333.0641 9.124 6.080 12.169 7.2e+04 4.4e+04 1.9e+04 3 2
CP2231 332.9529 332.9520 332.9557 12.168 9.124 14.198 2.6e+04 1.4e+04 9.2e+03 2 3
CP2232 333.0641 333.0641 333.0641 12.168 9.124 15.213 7.2e+04 4.3e+04 2.0e+04 3 3
CP2458 332.9518 332.9482 332.9632 11.154 8.109 16.227 2.6e+04 1.1e+04 7.7e+03 1 4

CP0599: noise signal picked up as chromatographic peak outside of main peak region.

Correspondence in xcms

A minority of features is defined by more than 4 mass peaks.

mz mzmin mzmax rt rtmin rtmax into intb maxo sn sample
CP0420 333.0618 333.0604 333.0641 11.154 8.109 14.198 6.8e+04 4.2e+04 1.8e+04 3 1
CP0599 333.1297 333.1277 333.1352 7.095 5.065 8.110 1.4e+03 1.3e+03 8.0e+02 10 2
CP1451 333.0641 333.0641 333.0641 9.124 6.080 12.169 7.2e+04 4.4e+04 1.9e+04 3 2
CP2231 332.9529 332.9520 332.9557 12.168 9.124 14.198 2.6e+04 1.4e+04 9.2e+03 2 3
CP2232 333.0641 333.0641 333.0641 12.168 9.124 15.213 7.2e+04 4.3e+04 2.0e+04 3 3
CP2458 332.9518 332.9482 332.9632 11.154 8.109 16.227 2.6e+04 1.1e+04 7.7e+03 1 4

Two distinct mass peaks grouped as a single feature.

Grouping direct injection MS data with xcms

  • Peak identification on a single spectrum in m/z dimension
  • Alternatively: peak detection on composite spectrum of multiple scans

Peak detection using MassSpecWavelet package (on raw data). A total of 11057 peaks are identified, for which of 4416 a S/N metric could be calculated.

  • Can be used as filter for biological or noise peak (but not selective)
msw <- MSWParam(scales = c(0.1,0.2,0.4,0.8,1,2,4,8), nearbyPeak = TRUE, winSize.noise = 
                500, SNR.method = "data.mean", snthresh = 2, ampTh = 0.00005, 
                peakScaleRange = 2, ridgeLength=24)

raw <- findChromPeaks(single, param=msw)
peaks <- as.data.frame(chromPeaks(raw))

Mass peaks are more uniformly detected in all samples.

Grouping direct injection MS data with xcms

A total of 2931 features were found in the spectra.

prm <- MzClustParam(sampleGroups = sampleData(raw)$sample_group)
raw <- groupChromPeaks(raw, param = prm)

features <- featureDefinitions(raw)
table(features$npeaks)
##    2    3    4 
##  558  593 1780

When removing the chromatographic peaks for which no S/N could be calculated, only 1212 features are found.

##   2   3   4 
## 367 366 479

Feature annotation

  • Multiple false hits in case of large ppm windows
  • Highly accurate exact mass required

Feature annotation

  • Multiple false hits in case of large ppm windows
  • Highly accurate exact mass required

Feature annotation

  • Multiple false hits in case of large ppm windows
  • Highly accurate exact mass required

Quality assessment of features

Whether or not a feature is biological or not can be judged on peak shape:

Kumler, W., et al. BMC Bioinformatics. 2023

  • Peak should be gaussian shaped
  • Signal-to-noise ratio cut-off
  • Signal background subtraction with baseline intensity

Custom approaches

  1. Region of interest identification: wavelet, similarity, tic, mass peaks
  2. Lock mass-free alignment procedure (inter- and intra- batch)
  3. Feature quantification
  4. Compound annotation (adducts and isotopes)

Custom approaches

  1. Region of interest identification: wavelet, similarity, tic, mass peaks
  2. Lock mass-free alignment procedure (inter- and intra- batch)
  3. Feature quantification
  4. Compound annotation (adducts and isotopes)

Lock mass-free approach to peak alignment

  • Minor discrepancies between spectra render comparative analysis difficult
  • Identification of ‘hook’ points consistently present in all mass spectra
  • Correct variations between spectra with a spectrum-specific correction

Brochu, F., et al. Scientific reports. 2019

Lock mass-free approach to peak alignment

  • Define spectral representation points for each distinct mass peak

Brochu, F., et al. Scientific reports. 2019

Lock mass-free approach to peak alignment

  • Group mass peaks within allowable mass window of the mass peak

Brochu, F., et al. Scientific reports. 2019

Lock mass-free approach to peak alignment

  • Group mass peaks within allowable mass window of the mass peak

Brochu, F., et al. Scientific reports. 2019

Lock mass-free approach to peak alignment

  • Group mass peaks within allowable mass window of the mass peak

Brochu, F., et al. Scientific reports. 2019

Lock mass-free approach to peak alignment

  • Group mass peaks within allowable mass window of the mass peak

Brochu, F., et al. Scientific reports. 2019

Lock mass-free approach to peak alignment

  • Inter-batch mass differences need consideration prior to alignment

Brochu, F., et al. Scientific reports. 2019

Custom approaches

  • Both centroided and raw spectral are required (mass alignment, feature annotation, …)
input <- 'E:/UGent_LIMET/02_Mass_spectrometry/raw_files/saliva_demo'
path_msconvert <- 'C:/Program Files/ProteoWizard/ProteoWizard 3.0.24045.2c2c542/
msconvert.exe'

raw <- 'E:/UGent_LIMET/02_Mass_spectrometry/mzML_files/raw/saliva_demo'
centroided <- 'E:/UGent_LIMET/02_Mass_spectrometry/mzML_files/centroided/saliva_demo'
msconvert(input, centroided, path_msconvert, processed='cwt', mz=0.025, snr=2.0, 
          filter='absolute', orientation='most-intense', threshold=200, dir=TRUE,
          verbose=TRUE)
msconvert(input, raw, path_msconvert, processed='none', dir=TRUE, verbose=TRUE)

Region of interest identification and digital lock masses

  • Identification of ‘hook’ points consistently present in all spectra
  • Select one or few similar spectra consistent in peak composition (> 20 samples)
cent <- readMsExperiment(spectraFiles = fls)
indcs <- scan_selection(cent, method='tic', write=TRUE)

Spectral representation and feature correspondence

  • Determine allowable mass window for correspondence
  • Determine spectral representations for all features

Inter-batch mass deviation correction

  • Correct any mass discrepancies between samples or larger cohorts
deviation <- calculate_mz_shift(cent, refPoints=ref_pts, indcs=indcs, 
                                shift='loess', dev=c(-150,150))
centroided <- correct_mz_drift(cent, factor=deviation)

Spectrum-specific mass correction

  • Centroided peaks might fall outside of allowable mass window
  • Apply spectrum-specific correction based on hook points

Estimating mass peak boundaries

  • Creation of composite spectrum yields better estimates of peak boundaries and of the accurate masses of each features
composite <- composite_spectrum(raw, indcs, algnPoints=algn_pts, combine='avg', 
                                normalise1='is', normalise2='tic') 

Feature quantification

  • Region of interest
    • Single scan (ie. highest summed intensity scan)
    • Across (part of) the ambient peak
  • Mass peak signal:
    • Centroided intensity
    • Integrated peak area

Compound annotation

  • Compound identification only possible with exact mass
  • Lock mass calibration necessary (eg. matrix-specific component)
composite <- recalibrate_masses(composite=composite, lockmass=215.0327891, 
                                snr=1, q_score=0.5, ppm=3.4e-5)

Sample Image

This is not a molecule? これは分子ではないのか?

This is a molecule. これは分子である。

This is a molecule. これは分子である。

This is a molecule. これは分子である。

This is a molecule. これは分子である。

This is a molecule. これは分子である。

Take home messages

  1. Consider ease of use, sensitivity and selectivity criteria when choosing appropriate data pre-processing methodology
  2. Be aware of inter- and intra-batch mass deviations
  3. Highly accurate estimate of exact mass needed for putative annotation
    • Supported by adducts and isotopic species
    • Feature filtering
  4. Scan selection and peak integration affect reproducibility of feature abundances
  5. Look at your raw data and play with it!

Sample Image